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Abstract 

We start from a simple asymptotic result for the problem of on- 
line regression with the quadratic loss function: the class of continuous 
limited-memory prediction strategies admits a "leading prediction strat- 
egy" , which not only asymptotically performs at least as well as any con- 
tinuous limited-memory strategy but also satisfies the property that the 
excess loss of any continuous limited-memory strategy is determined by 
how closely it imitates the leading strategy. More specifically, for any class 
of prediction strategies constituting a reproducing kernel Hilbert space we 
construct a leading strategy, in the sense that the loss of any prediction 
strategy whose norm is not too large is determined by how closely it im- 
itates the leading strategy. This result is extended to the loss functions 
given by Bregman divergences and by strictly proper scoring rules. 

1 Introduction 

Suppose J- is SL normed function class of prediction strategies (the "benchmark 
class"). It is well known that, under some restrictions on there exists a 
"master prediction strategy" (sometimes also called a "universal strategy" ) that 
performs almost as well as the best strategies in whose norm is not too large 
(see, e.g., The "leading prediction strategies" constructed in this paper 

satisfy a stronger property: the loss of any prediction strategy in whose norm 
is not too large exceeds the loss of a leading strategy by the divergence between 
the predictions output by the two prediction strategies. Therefore, the leading 
strategy implicitly serves as a standard for prediction strategies F in J- whose 
norm is not too large: such a prediction strategy F suffers a small loss to the 
degree that its predictions resemble the leading strategy's predictions, and the 
only way to compete with the leading strategy is to imitate it. 

We start the formal exposition with a simple asymptotic result (Proposition 
n in asserting the existence of leading strategies in the problem of on-line 
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regression with the quadratic loss function for the class of continuous limited- 
memory prediction strategies. To state a non-asymptotic version of this result 
(Proposition|2Jl we introduce several general definitions that are used throughout 
the paper. In the following two sections Proposition |21 is generalized in two di- 
rections, to the loss functions given by Bregman divergences and by strictly 
proper scoring rules (2J- Competitive on-line prediction typically avoids mak- 
ing any stochastic assumptions about the way the observations are generated, 
but in f|21we consider, mostly for comparison purposes, the case where observa- 
tions are generated stochastically. That section contains most of the references 
to the related literature, although there are bibliographical remarks scattered 
throughout the paper. The proofs are gathered in ^ The final section, fJ7| 
discusses possible directions of further research. 

There are many techniques for constructing master strategies, such as gra- 
dient descent, strong and weak aggregating algorithms, following the perturbed 
leader, defensive forecasting, to mention just a few. In this paper we will use 
defensive forecasting (proposed in [31] and based on [391 15^ and much earlier 
work by Levin, Foster, and Vohra). The master strategies constructed using 
defensive forecasting automatically satisfy the stronger properties required of 
leading strategies; on the other hand, it is not clear whether leading strategies 
can be constructed using other techniques. 

2 On-line quadratic-loss regression 

Our general prediction protocol is: 

On-line prediction protocol 
FOR 71 = 1,2,...: 

Reality announces a::„ G X. 

Predictor announces /i„ G P. 

Reality announces j/n G Y. 
END FOR. 

At the beginning of each round n Forecaster is given some side information Xn 
relevant to predicting the following observation ?/„, after which he announces 
his prediction The side information is taken from the information space 
X, the observations from the observation space Y, and the predictions from 
the prediction space P. The error of prediction is measured by a loss function 
A : Y X P R, so that X{yn, IJ'n) is the loss suffered by Predictor on round n. 

A prediction strategy is a strategy for Predictor in this protocol. More ex- 
plicitly, each prediction strategy F maps each sequence 

cso 

S = {Xi,yi, . . . ,Xn-l,yn~l,Xn) G S := (J (X X Y)"~^ X X (1) 

n=l 
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to a prediction F{s) G M; we will call S the situation space and its elements 
situations. We will sometimes use the notation 

s„ := {xi,yi,.. . ,x„_i,y„_i,x„) e S, (2) 

where Xi and yi are Reality's moves in the on-line prediction protocol. 

In this section we will always assume that Y = [—Y, Y] for some Y > 0, 
[—Y.Y] C P C R, and X{y,fi) — {y ~ /i)^; in other words, we will consider the 
problem of on-line quadratic-loss regression (with the observations bounded in 
absolute value by a known constant Y). 

Asymptotic result 

Let fc be a positive integer. We say that a prediction strategy F is order k 
Markov iiF{s„) depends on (0) only via Xmax(i,n-fc), ymax(i,n--fc), • ■ ■ , Xn-i,yn-i, Xn- 
More explicitly, F is order k Markov if and only if there exists a function 

/ : (X X Y)*' X X ^ P 

such that, for all n > A; and all (O, 

F (^Sji ) / , y^ — k ; ■ ■ ■ 1 -^n— 1 ! Vn — l ? -^n) • 

A limited-memory prediction strategy is a prediction strategy which is order 
k Markov for some k. (The expression "Markov strategy" being reserved for 
"order Markov strategy".) 

Proposition 1 Let Y = P = [—Y, Y] and be a metric compact. There exists 
a strategy for Predictor that guarantees 

n— 1 n—1 n—1 

as N oo for the predictions (f>n output by any continuous limited-memory 
prediction strategy. 

The strategy whose existence is asserted by Proposition ^ is a leading strategy 
in the sense discussed in ^ the average loss of a continuous limited-memory 
strategy F is determined by how well it manages to imitate the leading strategy. 
And once we know the predictions made by F and by the leading strategy, we 
can find the excess loss of F over the leading strategy without need to know the 
actual observations. 

Leading strategies for reproducing kernel Hilbert spaces 

In this subsection we will state a non-asymptotic version of Proposition^ Since 
P = M is a vector space, the sum of two prediction strategies and the product of 
a scalar (i.e., real number) and a prediction strategy can be defined pointwise: 

(Fi+F2)is):= Fi{s)+F2{s), {cF){s) := cF{s), s € S. 
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Let The a. Hilbert space of prediction strategies (with the pointwise operations 
of addition and multiplication by scalar) . Its embedding constant cjr is defined 

by 

Cjr sup sup |F(s)| . (4) 
sGS Fe:P:||F||^<i 

We will be interested in the case cjr < co and will refer to T satisfying this 
condition as reproducing kernel Hilbert spaces (RKHS) with finite embedding 
constant. (More generally, T is said to be an RKHS if the internal supremum 
in Q is finite for each s € S.) In our informal discussions we will be assuming 
that cjr is a moderately large constant. 

Proposition 2 Let Y = [— F, y], P = K, and T be an RKHS of prediction 
strategies with finite embedding constant Cyr. There exists a strategy for Predic- 
tor that guarantees 



iVn - l^n f + ^ (/i„ - (j)n f ' {Vn " 

n— 1 n— 1 n— 1 

< 2yy^c2, + i(||F|l^ + Y) Vn, vn e {i, 2, . . .} vf e (5) 

where (pn are F's predictions, (pn '■— F{sn). 

For an F whose norm is not too large (i.e., F satisfying ||F||jr ^ N^^^), |SJ) 
shows that 

1 ^ 2 1 ^ 2 1 ^ 

n—l n—1 n—1 

Proposition ^ is obtained by applying Proposition |2 to large ( "universal" ) 
RKHS. The details will be given in fJHl and here we will only demonstrate this 
idea with a simple but non-trivial example. Let k and m be positive integer 
constants such that m > fc/2. A prediction strategy F will be included in if 
its predictions 0„ satisfy 

if n < fc 
y„_i) otherwise, 

where / is a function from the Sobolev space W"'^^{[~Y,Y]^) (see, e.g., 
for the definition and properties of Sobolev spaces); \\F\\jr is defined to be 
the Sobolev norm of /. Every continuous function of {yn-k, ■ ■ ■ iVn-i) can be 
arbitrarily well approximated by functions in VK™'^([— F, F]''), and so is a 
suitable class of prediction strategies if we believe that neither xi, . . . ,a;„ nor 
j/i, . . . , yn-k-i are useful in predicting y„. 




4 



Very large benchmark classes 



Some interesting benchmark classes of prediction strategies are too large to 
equip with the structure of RKHS [23]. However, an analogue of Proposition 
121 can also be proved for some Banach spaces T of prediction strategies (with 
the pointwise operations of addition and multiplication by scalar) for which the 
constant cjr defined by Q is finite. The modulus of convexity of a Banach space 
U is defined as the function 

(0,2], 

where Su ■= {u U \ \\u\\jj = 1} is the unit sphere in U. 

The existence of leading strategies (in a somewhat weaker sense than in 
Proposition |2Jl is asserted in the following result. 

Proposition 3 Let Y = [— y, y], P = R, and T he a Banach space of predic- 
tion strategies having a finite embedding constant Cjr (see and satisfying 

Vee(0,2]:<5^(e)>(e/2)7p 

for some p €E [2. oo). There exists a strategy for Predictor that guarantees 

N N N 

n—1 n—1 n—1 

< 4oyy^4~fT (lli^ll^ + y) N^~^'p, ViV e {i, 2, . . .} VF e (6) 

where (/)„ are F 's predictions. 

The example of a benchmark class of prediction strategies given after Propo- 
sition 12 but with / ranging over the Sobolev space VF*'P([— y, y]*^), s > k/p, 
is covered by this proposition. The parameter s describes the "degree of reg- 
ularity" of the elements of W^'P, and taking sufhciently large p we can reach 
arbitrarily irregular functions in the Sobolev hierarchy. 



Su{e) 



inf 





U + V 
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3 Predictions evaluated by Bregman diver- 
gences 

A predictable process is a function F mapping the situation space S to M, F : 
S ^ K. Notice that for any function tp : P ^ M. and any prediction strategy 
F the composition V'(^) (mapping each situation s to tp{F{s))) is a predictable 
process; such compositions will be used in Theorems EEl below. A Hilbert 
space of predictable processes (with the usual pointwise operations) is called 
an RKHS with finite embedding constant if 10} is finite. 

The notion of Bregman divergence was introduced in |S] , and is now widely 
used in competitive on-hne prediction (see, e.g., ^lElElElEI)- Suppose 
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Y = P C K (although it would be interesting to extend Theorem ^ to the case 
where R is replaced by any Euclidean, or even Hilbert, space). Let ^E" and 4"' 
be two real-valued functions defined on Y. The expression 

d^,^,{y,z):=^{y)-^{z)-^'{z){y~z), y, z G Y, (7) 

is said to be the corresponding Bregman divergence if d^i_<s/'{y, z) > whenever 
y ^ z. (Bregman divergence is usually defined for y and z ranging over a 
Euclidean space.) In all our examples ^E" will be a strictly convex continuously 
differentiable function and 5*' its derivative, in which case we abbreviate d^.^' 
to d^. 

We will be using the standard notation 

ll/llcM) — sup|/(y)|, 

where A is a subset of the domain of /. 

Theorem 1 Suppose Y = P is a bounded subset of M. Let J- be an RKHS of 
predictable processes with finite embedding constant Cjr and ^P, 5*' be real-valued 
functions onY — F. There exists a strategy for Predictor that guarantees, for 
all prediction strategies F and N = 1,2, . . ., 

N N N 

n—1 n—1 n—1 

< diam(Y)y^4 + l (||vI''(F)||^ + ||*'||c(y)) ^> («) 
where (pn are F 's predictions. 

The expression |j^''(F)||jr in JHJ is interpreted as oo when ^'{F) ^ J^; in this 
case © holds vacuously. Similar conventions will be made in all following 
statements. 

Two of the most important Bregman divergences are obtained from the 
convex functions ^{y) := y^ and ^{y) '■— yhiy + {1 — y)ln(l — y) (negative 
entropy, defined for y G (0, 1)); they are the quadratic loss function 

d^{y,z)^{y^zf (9) 

and the relative entropy (also known as the KuUback-Leibler divergence) 

d^iy,z) = D{y\\z) := y In ^ + (1 - j/) In (10) 

z 1 — z 

respectively. If we apply Theorem^to them, @ leads (assuming Y — [—Y, Y]) 
to a weaker version of Proposition [3 with the right-hand side of (jSJ twice as 
large as that of and H1U|) leads to the following corollary. 
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Corollary 1 Let e E (0, 1/2), Y = P = [e, 1 — e], and the loss function be 

Ky-,fA = D{y II 

(defined in MU\) ). Let J- be an RKHS of predictable processes with finite embed- 
ding constant Cj:. There exists a strategy for Predictor that guarantees, for all 
prediction strategies F , 

N N N 

^X{yn,fin) + ^\{tJ.n,<l>n) - ^ A (?/„ , (/)„) 



n—1 n—1 n—1 



< a/c2,+ 1 



m. ^ 



Inl — -]^/N, VA^e {1,2,...}, 



e 



1 -F 

where (j)„ are F 's predictions. 

The log likelihood ratio In appears because VE''(2/) — In in this case. 

Analogously to Proposition |21 Theorem ^ (as well as Theorems |2H3 in the 
next section) can be easily generalized to Banach spaces of predictable processes. 
One can also state asymptotic versions of Theoremsm^lsimilar to Proposition^ 
and the continuous limited-memory strategies of Proposition^could be replaced 
by the equally interesting classes of continuous stationary strategies (as in |84j ) 
or Markov strategies (possibly discontinuous, as in j23|)- We will have to refrain 
from pursuing these developments in this paper. 

4 Predictions evaluated by strictly proper scor- 
ing rules 

In this section we consider the case where Y = {0, 1} and P C [0, 1]. Every 
loss function A : Y x P ^ M will be extended to the domain [0, 1] x P by the 
formula 

Xip,^i) :=pA(l,/i) + (l-p)A(0,/i); 

intuitively, X{p, fi) is the expected loss of the prediction fi when the probability 
of J/ = 1 is p. Let us say that a loss function A is a strictly proper scoring rule if 

\/p, fi e P : p ^ fi =^ X{p,p) < X{p, fi) 

(it is optimal to give the prediction equal to the true probability of y — 1 when 
the latter is known and belongs to P). In this case the function 

dx{fi,(l)) := A(/x,0) - A(^,^) 

can serve as a measure of difference between predictions /i and (/>: it is non- 
negative and is zero only when /i = </>. (Cf. §4.) 
The exposure of a loss function A is defined as 

Exp^(Ai) :-A(l,/i)-A(0,M), M € P. 
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Theorem 2 Let Y = {0,1}, P C [0,1], X be a strictly proper scoring rule, 
and T he an RKHS of predictable processes with finite embedding constant cjr. 
There exists a strategy for Predictor that guarantees, for all prediction strategies 



F and all N ■ 



N 



1,2,. 



N 



N 



ra=l 



L 



< 



l|Exp,||c(P)) VA^, (11) 



where (/)„ are F 's predictions. 



Two popular strictly proper scoring rules are the quadratic loss function 
A(i/,/i) := {y ~ and the log loss function 



In /i if y 

In(l-M) if y 



Applied to the quadratic loss function, Theorem[21 becomes essentially a special 
case of Proposition 121 For the log loss function we have d\{ij., (f) = D{fi \\ </>), 
and so we obtain the following corollary. 

Corollary 2 Let e e (0,1/2), Y = {0,1}, P = [e, 1 - e], A be the log loss 
function, and T be an RKHS of predictable processes with finite embedding con- 
stant Cjr . There exists a strategy for Predictor that guarantees, for all prediction 
strategies F, 



N 



N 



N 



ri=l 



< 



^ D (/i„ II (/)„) - ^ A (y„, (j)n) 

n—1 n—1 

F 



In- 



1 -F 



In- 



Vn, VTVe {1,2,...}, 



where (j)^ are F 's predictions. 



A weaker version (with the bound twice as large) of Corollary 12 would be 
a special case of Corollary ^ were it not for the restriction of the observation 
space Y to [e, 1 — e] in the latter. Using methods of 31 , it is even possible to 
get rid of the restriction P = [e, 1 — e] in Corollary|21 Since the log loss function 
plays a fundamental role in information theory (the cumulative loss corresponds 
to the code length), we state this result as our next theorem. 

Theorem 3 Let Y — {0, 1}, P = (0, 1), A be the log loss function, and T be an 
RKHS of predictable processes with finite embedding constant Cjr. There exists 
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a strategy for Predictor that guarantees, for all prediction strategies F, 

N N N 

^ A (y„, fin) + {fin II ^ A (y„, (j>n) 

n—1 n—1 n—1 



+ i)^/n, VA^e {1,2,...}, 
where (j)n are F 's predictions. 



< 



1.8 



In- 



F 
1~F 



5 Stochastic Reality and JefFreys's law 

In this section we revert to the quadratic regression framework of S|21and assume 
Y = P = [— y,F], X{y,fi) = (y — fiY . (It will be clear that similar results 
hold for Bregman divergences and strictly proper scoring rules, but we stick to 
the simplest case since our main goal in this section is to discuss the related 
literature.) 

Proposition 4 Suppose Y = P = [— y, y]. Let F be a prediction strategy and 
yn G [~^, ^] ^6 generated as yn '■= F{sn) + (remember that s„ are defined 
by where the noise random variables ^„ have expected value zero given Sn- 
For any other prediction strategy G, any N G {1, 2, . . .}, and any S G (0, 1), 



N N N 

n—1 n—1 n—1 



<4y2^/21n|v^ (12) 



with probability at least I — S, where 0„ are F 's predictions and fin are G 's 
predictions. 

Combining Proposition0]with Proposition El we obtain the following corollary. 

Corollary 3 Suppose Y = P = [— y, Y] . Let T be an RKHS of prediction 
strategies with finite embedding constant Cjr , G be a prediction strategy whose 
predictions fin are guaranteed to satisfy 0) (a "leading prediction strategy"), F 
be a prediction strategy in T , and yn S [— y, Y\ be generated as yn '■= F{sn)+£,ni 
where the noise random variables ^„ have expected value zero given s„ . For any 
N G {1, 2, . . .} and any 5 G (0, 1), the conjunction of 

N N 



< y^c2, + i(||i^||^ + y)V^ + 2y2-^/2in|\/iv (13) 



and 

N 



- f^nf < Y^c% + 1 (||F||^ + Y)^/N + 2Y\j2\n-^/N (14) 



n=l 
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holds with probability at least 1 — where (/>„ are F 's predictions and fin are 
G 's predictions. 

We can see that if the "true" (in the sense of outputting the true expectations) 
strategy F belongs to the RKHS and ||-F||jr is not too large, not only the 
loss of the leading strategy will be close to that of the true strategy, but their 
predictions will be close as well. 

Jeffreys's law 

In the rest of this section we will explain the connection of this paper with the 
phenomenon widely studied in probability theory and the algorithmic theory 
of randomness and dubbed "Jeffreys's law" by Dawid ^1 E]. The general 
statement of "Jeffreys's law" is that two successful prediction strategies produce 
similar predictions (cf. §5.2). To better understand this informal statement, 
we first discuss two notions of success for prediction strategies. 

As argued in there are (at least) two very different kinds of predictions, 
which we will call "S-predictions" and "D-predictions" . Both S-predictions and 
D-predictions are elements of [— y, y] (in our current context), and the prefixes 
"S-" and "D-" refer to the way in which we want to evaluate their quality. S- 
predictions are Statements about Reality's behaviour, and they are successful 
if they withstand attempts to falsify them; standard means of falsification are 
statistical tests (see, e.g., ^J, Chapter 3) and gambling strategies (|2H1; for a 
more recent exposition, see |211)- D-predictions do not claim to be falsifiable 
statements about Reality; they are Decisions deemed successful if they lead to 
a good cumulative loss. 

As an example, let us consider the predictions 0„ and /i„ in Proposition 
^ The former are S-predictions; they can be rejected if ((T^ fails to happen 
for a small S (the complement of (|12() can be used as the critical region of a 
statistical test). The latter are D-predictions: we are only interested in their 
cumulative loss. If 0„ are successful f (|12|) holds for a moderately small S) 
and fin are successful (in the sense of their cumulative loss being close to the 
cumulative loss of the successful S-predictions 0„; this is the best that can be 
achieved as, by (|12(l . the latter cannot be much larger than the former), they 

will be close to each other, in the sense X]n=i(*^" ~ A'")^ ^ '^^^ 
that Proposition 0] implies a "mixed" version of Jeffreys's law, asserting the 
proximity of S-predictions and D-predictions. 

Similarly, Corollary |21 is also a mixed version of Jeffreys's law: it asserts 
the proximity of the S-predictions (j)„ (which are part of our falsifiable model 
Un = (j^n + in) and the D-predictions /i„ (successful in the sense of leading to a 
good cumulative loss; cf. (O). 

Proposition [5] immediately implies two "pure" versions of Jeffreys's laws for 
D-predictions: 

• if a prediction strategy F with ||f ||j)r not too large performs well, in the 
sense that its loss is close to the leading strategy's loss, F's predictions 
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will be similar to the leading strategy's predictions; more precisely, 

N N N 

^ (0„ - flnf < {Vn - (j^nf " ^ {ijn " ^J.nf 

n=l n—1 n—1 

+ 2yyc^Ti + Y) VN; 

• therefore, if two prediction strategies Fi and F2 with ||Fi||jir and ||-F2||jr 
not too large perform well, in the sense that their loss is close to the 
leading strategy's loss, their predictions will be similar. 

It is interesting that the leading strategy can be replaced by a master strategy 
for the second version: if Fi and F2 gave very different predictions and both 
performed almost as well as the master strategy, the mixed strategy {Fi +F2)/2 
would beat the master strategy; this immediately follows from 

y _ (01 " 2/)' + (02 - y)' 

\ 2 2 V 2 

where 0i and 02 are Fi's and F2's predictions, respectively, and y is the obser- 
vation. 

The usual versions of Jeffreys's law are, however, statements about S- 
predictions. The quality of S-predictions is often evaluated using universal sta- 
tistical tests (as formalized by Martin-L6f 123) or universal gambling strategies 
(Levin [221, Schnorr [^). For example, Theorem 7.1 of jlS, and Theorem 3 
of [21] state that if two computable S-prediction strategies are both successful, 
their predictions will asymptotically agree. Earlier, somewhat less intuitive, 
statements of Jeffreys's law were given in terms of absolute continuity of prob- 
ability measures: see, e.g., Q and [201. Solomonoff [2^1 proved a version of 
Jeffreys's law that holds "on average" (rather than for individual sequences). 

This paper is, to my knowledge, the first to state a version of Jeffreys's law 
for D-predictions (although a step in this direction was made in Theorem 8 of 

m)- 

6 Proofs 

In this section we prove, or give proof sketches of. Propositions ^-01 and The- 
orems n-El Proposition |21 is a special case of Theorem but its proof is more 
intuitive and we give it separately (proving Proposition |31 along the way). 

Proof of Propositions [51 and [^ 

Noticing that 

N N N 

n—1 n—1 n—1 
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N 



< 2 



N 



n=l 



N 

E 

n=l 



(15) 



we can use the results of PEl) §6, asserting the existence of a prediction strategy 
producing predictions G [— y, y] that satisfy 



N 

E 

n=l 



Ai« (yn - Mn) 



+ IViV 



(see (24) in this a special case of good calibration) and 



N 



^ ^« iVn - Mn) 



< Y 



\\\F\\^yfN 



(16) 



(17) 



(see (25) in [32]; this a special case of good resolution). 

Replacing and (|17|l with the corresponding statements for Banach func- 
tion spaces ( 35 , (52) and (53)) we obtain the proof of Proposition 13 

Remark In ^311 we considered only prediction strategies F for which F{sn) 
depends on s„ (see ^) via Xn, in the terminology of this paper these are (order 
0) Markov strategies. It is easy to see that considering only Markov strategies 
does not lead to a loss of generality: if we redefine the object Xn as Xn '■= s„, 
any prediction strategy will become a Markov prediction strategy. 



Proof of Proposition [T] 

Proposition n will follow from the following lemma, proved (without stating it 
explicitly) in [27] (proof of Theorem 2). 

Lemma 1 f |27p Let Q be a separable set in C{Z). There exists an RKHS T 
on Z with finite embedding constant such that T is dense in Q in metric C(Z). 

Proof Let Fi, F2, ... be a dense (in metric C{Z)) sequence of elements of Q. 
Set 

^^.^ f2-"||^^„llc[z)^« ifi^n^O 
1 otherwise, 

$(z) := ($i(z),$2(^),...)e^2forzGZ, 

K{z,z'):=mz),^z')),^, z,z'eZ, 

and let be the unique RKHS with reproducing kernel K (see the Moore- 
Aronszajn theorem in ^3,, Theorem 2). It is clear that = sup^K.{z, z) is 
finite. By Lemma |21 below, each F„ belongs to T since it can be represented as 

(2"||F„||^(^)e„,<i>(.))^ , 
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where e„ £ ^2 consists of all Os except a 1 at the nth position. Therefore, T is 
dense in C/. I 

The following lemma was used in the proof. 

Lemma 2 LeA, ^ . Z H , where H is a Hilbert space. The RKHS correspond- 
ing to the reproducing kernel 'K.{z, z') :~ {^{z),^{z')) h consists of all functions 
{v,^{-)) H , V £ H, with the inner product of {v,^{-)) h o-nd (?;',$(•))// equal to 
{p(v),p{v')) , p standing for the projection onto the span of^{Z). 

Proof By the Moore-Aronszajn theorem ( 3 , Theorem 2) there is a unique 
RKHS with reproducing kernel K, so we only need to check that the function 
space J- defined in the statement of the lemma is an RKHS with K as repro- 
ducing kernel. 

First we need to check that the inner product is well defined. This follows 
from the obvious fact that the equality of the functions and 
for v,v' € span(<i>(Z)) implies v — v' . The continuity of each evaluation func- 
tional is also obvious. 

The representer of z e Z is Kz(-) := ($(z),$(-))h (in the sense that 
(Kz,f)j^ = f{z) for each f G !F) and so the reproducing kernel (K2,Kz')jf 
of J- indeed coincides with K. I 

Now we can easily deduce Proposition ^ from Proposition |2 The set of all 
continuous order k Markov strategies is a separable set in the Banach space 
C(S) of continuous prediction strategies with the sup metric (by ^Jj, Corollary 
4.2.18). Therefore, the set Q of all continuous limited-memory strategies is 
separable in C(S). 

Let be the RKHS whose existence is asserted by Lemma ^ we will see 
that any strategy for Predictor satisfying (O and /i„ e [— F, Y] will satisfy lO 
with (pn output by a limited-memory strategy F. Indeed, for any e > we can 
find F* € T that is e-close in C(S) to G. If 0„ are F's predictions and 0* are 
F*'s predictions, lO imphes that 

AT , N , N 



n—1 n—1 n—1 

1 ^ 2 1 ^ 2 1 ^ 



n—1 71—1 n—1 



+ 8(r + e)e 



< 2yJc% + 1 (||F||^ + Y)^+8{Y + e)e < 10(r + e)e 



< 



from some N on. Since e can be taken arbitrarily small, we have 
Proof of Theorem [T] 

The proof is based on the generalized law of cosines 

d*,*' (y, (f)) = d^,,*' (/i, 0) + d*,*' {y, /i) - (*'(0) - (y - ^I) (18) 
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(which follows directly from the definition O). From (|18|l we deduce 



N 



N 



N 



n—1 n—1 
N 



n=l 

N 



< 



TV 



(19) 



From Theorem 3 in we can see that there is a prediction strategy guaran- 
teeing 



N 



< diam(Y) H^-'] 



C(Y) 



N 



(20) 



and from Theorem 4 in 'SSI we can see that there is a prediction strategy 
guaranteeing 



N 



n=l 



< diam(Y)c^ |1'J''(F)|U V7V 



(21) 



We need, however, a single strategy guaranteeing some versions of H2U|) and (|21|) . 
Such a strategy can be obtained by merging a strategy guaranteeing H2()|l and a 
strategy guaranteeing (as in Corollaries 3 and 4). 
Setting 



I*' 



,kJ e M X J^, /i e P, s e S, 



(22) 



IC(Y) 



so that c$ < -^/c^ + 1 , and letting be output by the K29 algorithm based 
on H22|l . we obtain 



N 



< ll*'l 



C(Y) 



AT 



E(y"~^")*(AinjS„) 



< diam(Y) Jc2, + iVn (23) 



from Theorem 3 of |38j. and we obtain 



N 



E(2/n-Ain)(k.„,^'(i^)) 



AT 



E(yn-Mn)k,„,vE''(F) 



\n=l 



E - Mn) K 
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AT 



1x3^ 



< ||«''(F)||^diam(Y)w'c2, + WN (24) 



from the proof of Theorem 4 and from Theorem 3 of 38 . 

Combining p9l) with (|23|l and H24|l we can see that (|22|l produees a strategy 
guaranteeing 

Remark As we mentioned earher, the leading constant in the bound of Theo- 
remn(and its corollary) is worse than those in other results in this paper, in the 
intersection of their domains of application. The explanation is that Theorem^ 
is based on the K29 algorithm, whereas all other results are based on the more 
sophisticated "K29* algorithm". 

Proof sketch of Theorem [H 

The proof is similar to that of Theorem ^ with the role of the generalized law 
of cosines ((TH|) played by the equation 



A(y, (/>) = a + X{y, /i) + b{y - fi) 



(25) 



for some a = a(/i, (jj) and b = b{fi, 4>). Since y can take only two possible values, 
suitable a and b are easy to find: it suffices to solve the linear system 

{\{l,(t>) = a + \{l,^i) + b{l- ^l) 

\x{0,(t>) = a + \{{),^l)+b{-n). 

Subtracting these equations we obtain b = Exp((?!)) — Exp(/i) (abbreviating Exp_^ 
to Exp), which in turn gives a — d\{p, (/)). Therefore, (|25|l gives 



N 



N 



N 



^ d\ {fin, (/)„)- ^ A (?/„, (/)„) 
N 

X! (Exp((^„) - Exp(^„)) (y 



< 



n=l 
N 



Exp(/x„) (y 



N 



^Exp((?!)„) (y„ - ^„) 



71=1 



There are prediction strategies that guarantee 

N 



^ Exp(/z„) {yn 



n=l 



< i||Exp||^(P)VA^ 



(26) 



(27) 
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(cf. [3T]j Theorem 2) and there are prediction strategies that guarantee 



N 



Exp(F(s„)) (y„ - 



<^\\ExpiF)\\^VN 



(28) 



(cf. [3T]i Theorem 3); merging such strategies as in [3^1, Corollaries 3 and 4, we 
can easily obtain from (|26|l . (|27|l . and 



Proof sketch of Theorem |31 

It is shown in 31 that there is a prediction strategy guaranteeing 



N 



^ Exp(^„)(2;„ - fin) 



n=l 



< 



N 



. /J„(l - Hn) (Exp^(^„) + K(s„, Sn)) (29) 

\ n=l 



and 



N 



5^Exp(F(s„))(2/ 



< ||Exp(F)||^, 



N 



\ M«(l ~ M») (Exp^(/x„) + K(s„, Sn)) (30) 

\ n=l 

(see (21), (22), and the subsection "Proof: Part 11" in 31 , the technical report), 
where K is the reproducing kernel of T. Comparing 129|l and 1)30(1 with l|26|l . 
we can see that Theorem |21 will follow from 



/c^ + 1. 

\ X! ^'^^^ ~ (Exp^(^„) + K(s„, Sn)) < ^ '- 

\ n=l ^ 



which in turn will follow from 



/i(l - fi) In 



2 M 



l-fi 

It remains to notice that /i(l — /x) < 1/4 and to calculate 

2 M 



sup 4/i(l — fi) In 



1-Ai 



1.76 < 1. 



Proof of Proposition |31 

This proposition immediately follows from the equality in (|15|) and Hocffding's 
inequality (see, e.g., ^Hl, P- 135). 
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7 Conclusion 



The existence of master strategies (strategies whose loss is less than or close 
to the loss of any strategy with not too large a norm) can be shown for a 
very wide class of loss functions. On the contrary, leading strategies appear to 
exist for a rather narrow class of loss functions. It would be very interesting to 
delineate the class of loss functions for which a leading strategy does exist. In 
particular, does this class contain any loss functions except Bregman divergences 
and strictly proper scoring rules? 

Even if a leading strategy does not exist, one might look for a strategy G 
such that the loss of any strategy F whose norm is not too large lies between the 
loss of G plus some measure of difference between F's and G"s predictions and 
the loss of G plus another measure of difference between F's and G's predictions. 
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